Import Packages

The code below imports the packages used in this assignment.

Author’s Note

The author is aware that some packages here do not directly contribute to the report, as he was trying out different plots and packages but some did not work / were unsatisfactory. Due to the lack of time, he did not remove the unused packages. Thank you for your kind understanding.

# Import Packages
suppressMessages({
  library(tidyverse)
  library(lubridate)    # Date time objects
  library(ggplot2)      # Plotting
  library(extrafont)    # Custom fonts
  library(scales)       # Map scale aesthetics
  library(ggthemes)     # Custom themes package
  library(bbplot)       # Custom bbc theme
  library(ggridges)     # Ridgeline plot package
  library(viridis)      # Color Maps
  library(hrbrthemes)   # Custom themes
  library(reshape2)     # Reshape data
  library(igraph)       # Graph Networks
  library(ggraph)       # Graph Visualisation
  library(ggforce)      # Additional Graph Functionality
  library(circlize)     # Chord Diagram
  library(stringr)      # Wrapper for String Operations
  library(networkD3)    # Sankey Diagram
  library(RColorBrewer) # Color Palettes
})
print('loaded')
## [1] "loaded"





Import Data

The code below imports the downloaded data.

# Import Data
suppressMessages({
  resale_flat_prices_2017 <- read_csv(
    'data/ResaleflatpricesbasedonregistrationdatefromJan2017onwards.csv'
    )
})
print('loaded')
## [1] "loaded"




Q1 - HDB Resale Flat Prices

Question 1A-1

In 2021, there have been 261 HDB flats transacted at or more than $1m. Compute how many such transactions there were in the last year.

Aim

Find the number of more than $1m transactions in 2023.

Approach

This would involve trimming the dataset down based on several conditional statements.

  1. Extract list of transactions in 2023
  2. Extract list of transactions >= $1m
  3. Obtain the total count of resultant dataframe.

Extract Transactions in 2023

The code below extracts the list of transactions that occurred in 2023.

#Filter 2023 transactions
transactions_2023 <- resale_flat_prices_2017 %>% 
  filter(grepl("2023", month))

Extract Transactions >= $1m in 2023

The code below filters the dataframe for transactions where resale_price >= $1m

#Extract >= $1m
transactions_2023_million <- transactions_2023 %>%
  filter(resale_price >= 1000000)

Extract Total Count

The code below gets the total count of the transactions_2023_million dataframe.

#Count
count_2023_million <- nrow(transactions_2023_million)
count_2023_million
## [1] 470

Answer

There were 470 resale transactions in 2023 with prices greater than or equal to $1m.





Question 1A-2

HDB resale prices rose 0.8 per cent in December 2021 from the previous month. Do the same comparison for the same period in the last year.

Aim

Find the percentage change in resale price between December 2023 and November 2023.

Approach

  1. Extract the list of transactions in December 2023 and November 2023.
  2. Calculate the mean of both lists.
  3. Calculate the percentage difference.

Extract Transactions

The code below extracts the list of transactions for December and November 2023.

# Filter rows where 'month' equals '2023-12'
transactions_2023_12 <- transactions_2023 %>%
  filter(month == "2023-12")

transactions_2023_11 <- transactions_2023 %>%
  filter(month == "2023-11")

Calculate Mean

This section calculates the mean resale price for each column.

# Calculate the mean of the 'resale_price' column
mean_resale_price_12_2023 <- mean(transactions_2023_12$resale_price, na.rm = TRUE)
mean_resale_price_11_2023 <- mean(transactions_2023_11$resale_price, na.rm = TRUE)
mean_resale_price_12_2023
## [1] 578587.3
mean_resale_price_11_2023
## [1] 581146.7

Calculate Percentage Difference

This section calculates the percentage difference between both months.

# Calculate the percentage difference
percentage_difference_1a2 <- ((mean_resale_price_12_2023 - mean_resale_price_11_2023) / mean_resale_price_11_2023) * 100
print(percentage_difference_1a2)
## [1] -0.4404132

Answer

HDB resale prices fell -0.44% in December 2023 from the previous month.





Question 1A-3

HDB resale prices in December 2021 were 13.6 per cent higher than a year ago. Compute the same but for the December in the last year.

Aim

Find the percentage change in resale price between December 2023 and December 2022.

Approach

  1. Extract the list of transactions in December 2023 and December 2022.
  2. Calculate the mean of both lists.
  3. Calculate the percentage difference.

Extract Transactions

The code below extracts the list of transactions for December 2022.

# Filter rows for 2022 transactions
transactions_2022 <- resale_flat_prices_2017 %>% 
  filter(grepl("2022", month))

# Filter rows where 'month' equals '2022-12'
transactions_2022_12 <- transactions_2022 %>%
  filter(month == "2022-12")

Calculate Mean

This section calculates the mean resale price for that month.

mean_resale_price_12_2022 <- mean(transactions_2022_12$resale_price, na.rm = TRUE)
mean_resale_price_12_2022
## [1] 549908.5

Calculate Percentage Difference

This section calculates the percentage difference between both months.

# Calculate the percentage difference
percentage_difference_1a3 <- ((mean_resale_price_12_2023 - mean_resale_price_12_2022) / mean_resale_price_12_2022) * 100
print(percentage_difference_1a3)
## [1] 5.215193

Answer

HDB resale prices in December 2023 were 5.22% per cent higher than a year ago in 2022.





Question 1A-4

There were 2,429 HDB flats sold in December 2021. Compute the same for December last year.

Aim

Find the number of transactions in December 2023.

Approach

  1. Extract the list of transactions in December 2023.
  2. Calculate the count of rows.

Calculate Count of Rows

This section counts the number of transactions.

# Calculate the count of rows 
num_of_transactions_12_23 <- nrow(transactions_2023_12)
num_of_transactions_12_23
## [1] 2009

Answer

There were 2,009 HDB flats sold in December 2023.





Question 1A-5

Replicate the first plot in the article (“HDB resale volume”, the one with the blue columns), including the approximate style (it does not have to be exactly the same), but for the data in the last year.

Aim

Plot a graph showing the HDB resale volume from Dec 2022 - Dec 2023 in the Straits Times Format.

Approach

  1. Obtain a list of transactions from Dec 2022 - Dec 2023.
  2. For each month, calculate the number of transactions and store it as a new dataframe.
  3. Plot the graph.

Data Transformation for Dates

First we convert the ‘month’ column from the original dataframe to a date format to allow for easier filtering later on.

# Convert 'month' column to date format
resale_flat_prices_2017$month <- ymd(paste0(resale_flat_prices_2017$month, "-01"))

Filter to Analysis Period Obtain a list of transactions from Dec 2022 - Dec 2023.

# Extract list
transactions_2212_2312 <- resale_flat_prices_2017 %>%
  filter(month >= as.Date("2022-12-01") & month <= as.Date("2023-12-01"))

Calculate Number of Transactions per Month

This section calculates the number of transactions for each month.

# Compute the number of transactions for each month

transactions_per_month <- transactions_2212_2312 %>%
  group_by(month) %>%
  summarise(number_of_transactions = n(), .groups = "drop")

head(transactions_per_month)

Graph Pre-Processing

This section imports custom fonts and sets specific colors to match the desired graph.

suppressMessages({
  # Create a color vector with a default blue color, and a darker shade for the first and last bars
  color_vector <- rep("#c4ddf3", nrow(transactions_per_month))
  color_vector[1] <- "#599ddc" # First bar 
  color_vector[nrow(transactions_per_month)] <- "#599ddc" # Last bar 

  # Specify the path to the directory
  font_path <- "C:/Users/colin/OneDrive/Desktop/data science mod/working/fonts"

  # Import the fonts 
  font_import(paths = font_path, prompt = FALSE)

  # Load fonts 
  loadfonts(device = "win") # Use "pdf" for PDF output on non-Windows systems

  # List the fonts 
  fonts()
})
## [1] "EconSansCndBol"    "EconSansCndBolIta" "EconSansCndLig"   
## [4] "EconSansCndLigIta" "EconSansCndMed"    "EconSansCndMedIta"
## [7] "EconSansCndReg"    "EconSansCndRegIta"

Plot

# Plot the data 
p <- ggplot(transactions_per_month, aes(x = month, y = number_of_transactions)) +
  geom_col(color = "white", fill = color_vector) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0, margin = margin(b = 10), family = "EconSansCndBol", colour = "#666666"),  
    plot.subtitle = element_text(hjust = 0, family = "EconSansCndReg", colour = "#666666"), 
    axis.text.x = element_text(vjust = 1, hjust = 1, family = "EconSansCndReg", colour = "#666666"),  
    axis.text.y = element_text(family = "EconSansCndLig", colour = "#666666"), # Lighter grey color for Y axis text
    legend.position = "none",
    panel.grid.major.x = element_blank(), # Ensure no vertical grid lines
    panel.grid.minor.x = element_blank(), # Ensure no minor vertical grid lines
    panel.grid.major.y = element_line(color = "#d3d3d3"), 
    panel.background = element_blank(), # Clean background
    panel.grid.minor.y = element_blank() # Remove minor horizontal grid lines 
  ) +
  labs(
    title = "HDB Resale Volume",
    subtitle = "No. of transactions",
    x = "",
    y = ""
  ) +
  scale_y_continuous(
    labels = scales::comma, 
    breaks = seq(0, 3000, by = 500), 
    limits = c(NA, 3000) # Increase the y-axis limit to 3000
  ) +
  scale_x_date(
    date_labels = "%b", 
    date_breaks = "1 month",
  ) +
  geom_text(
    data = subset(transactions_per_month, format(month, "%m") == "12"), 
    aes(label = scales::comma(number_of_transactions)), 
    vjust = -0.3, 
    fontface = "bold", 
    colour = "#666666" 
  )

# Render 
p





Q1B - Exploratory Data Analysis

Before starting on the analysis with insights, we scope through the dataset to understand the structure and feel of the data. We also conduct a very basic check for null values.

Dataset Overview

Display the top 5 rows to get a general feel of the data.

# View top 5 rows
head(resale_flat_prices_2017)

Observation

The dataset is arranged in ascending numerical and alphabetical order, starting with the year 2017 and with Ang Mo Kio.

Next we check the last 5 entries of the dataset to understand how updated it is.

# View bottom 5 rows
tail(resale_flat_prices_2017,5)

Observation

The latest entry is in February 2024, of which the current date is 23 February 2024. A quick check from the beta.data.gov.sg website shows this dataset was last updated 8 hours ago, thus it can be assumed that this dataset is most updated up to this current week.

Dataset Size

We obtain the shape of the dataset to understand the size of the dataset we will be working with.

# Shape of dataset
dim(resale_flat_prices_2017)
## [1] 173334     11

Observation

The dataset has 173,334 rows and 11 columns, considered quite a large dataset to deal with.

Dataset Columns

Next we print all column names to understand the variables we will be working with.

# Print column names
names(resale_flat_prices_2017)
##  [1] "month"               "town"                "flat_type"          
##  [4] "block"               "street_name"         "storey_range"       
##  [7] "floor_area_sqm"      "flat_model"          "lease_commence_date"
## [10] "remaining_lease"     "resale_price"

Observation

The dataset can be broken down into 5 main categories namely: time, location, flat details, lease and price.

Dataset Summary

Next we print a summary of the dataset to understand the datatypes of each column.

# Print summary
summary(resale_flat_prices_2017)
##      month                town            flat_type            block          
##  Min.   :2017-01-01   Length:173334      Length:173334      Length:173334     
##  1st Qu.:2019-01-01   Class :character   Class :character   Class :character  
##  Median :2020-12-01   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2020-10-01                                                           
##  3rd Qu.:2022-07-01                                                           
##  Max.   :2024-02-01                                                           
##  street_name        storey_range       floor_area_sqm    flat_model       
##  Length:173334      Length:173334      Min.   : 31.00   Length:173334     
##  Class :character   Class :character   1st Qu.: 82.00   Class :character  
##  Mode  :character   Mode  :character   Median : 93.00   Mode  :character  
##                                        Mean   : 97.24                     
##                                        3rd Qu.:112.00                     
##                                        Max.   :249.00                     
##  lease_commence_date remaining_lease     resale_price    
##  Min.   :1966        Length:173334      Min.   : 140000  
##  1st Qu.:1985        Class :character   1st Qu.: 368000  
##  Median :1996        Mode  :character   Median : 463000  
##  Mean   :1996                           Mean   : 493357  
##  3rd Qu.:2009                           3rd Qu.: 587000  
##  Max.   :2022                           Max.   :1568888

Observation

The dataset has 2 data types, strings and numerical values. However, it looks like some data transformation is needed later on for some columns to convert its string value to a numerical value (eg. remaining_lease).

Check for Null Values

The code below checks the dataset for null values.

# null value check
na_check <- colSums(is.na(resale_flat_prices_2017)) > 0
print(na_check)
##               month                town           flat_type               block 
##               FALSE               FALSE               FALSE               FALSE 
##         street_name        storey_range      floor_area_sqm          flat_model 
##               FALSE               FALSE               FALSE               FALSE 
## lease_commence_date     remaining_lease        resale_price 
##               FALSE               FALSE               FALSE

Observation

The dataset does not look to have any null values for now. However, we cannot assume that the values are free of error of course. It could be the same where null values are filled with 0 or 9999.





Analysis Overview Summary

The section below lists an overview of the research questions asked, the insights found, and the corresponding type of visualization used.

Research Question 1

Question:

Is there a relationship between floor area and resale price?

Type of Visualisation:

Boxplot and Violin

Insight:

From the boxplot, it can be observed that there exists outliers within each category except for the 200-250sqm range, indicating that while floor area affects the resale price, there are other stronger factors possibly like location which prompts buyers to pay more for the same floor area.

Further more the violin plot, the increasing sleekness of the shape as the floor area category increases suggests that as the floor area increases, the variance of resale prices buyers are willing to pay increases.

Research Question 2

Question:

Does the storey range of a flat have an impact on its resale price?

Type of Visualisation:

Bar graph for each flat type and storey range.

Insight:

Multigeneration and one room flats have a low storey range of only up to 12 floors. You would expect that as the storey range increases, the resale price increases, however after a certain floor range, usually the 90% of the storey range of that flat type, the resale prices falls, as seen from the 2room, 3room, 5room and execuetive flats.

Research Question 3

Question:

Is there relationship between location and flat type?

Type of Visualisation:

Heatmap.

Insight:

Darker colors indicate higher average resale prices. Lighter colors represent lower average resale prices, indicating more affordable options. From the heatmap, it can be deduced that location does matterfor resale price of the same flat type, as can be seen in how Central area and Queenstown outperforms its peers for the flat type.





Research Question 1 - Is there a relationship between floor area and resale price?

Approach

To investigate this, we would need to:

  1. Determine the time period of analysis
  2. Find out how many different flat types there are within this time period
  3. Filter the data with the above two conditions, group by flat type and find the median resale price.
  4. Plot the necessary visualisation
  5. Adjust plot for readability as necessary

Plot

First we plot a scatter plot with a regression line to understand if there is a trend in the floor area to resale price to flat type.

# Plot
ggplot(data = resale_flat_prices_2017, aes(x = floor_area_sqm, y = resale_price)) +
  geom_point(aes(color = flat_type), alpha = 0.6) +  # Points with updated color scheme
  geom_smooth(method = "lm", se = FALSE, color = "black") +  # Add regression line
  labs(x = "Floor Area (sqm)", y = "Resale Price", title = "Relationship Between Floor Area and Resale Price") +
  theme_minimal() +
  scale_color_brewer(palette = "Set3")  
## `geom_smooth()` using formula = 'y ~ x'

Observation

There seems to be a positive correlation between floor area and resale price, however the distinction between the flat types is not as clear due to the large set of data points. It might be good to further segment floor area for better readability.

Understanding Distribution of Floor Area

First we call a summary of the floor area to understand the median, min max etc.

# Summary of 'floor_area_sqm' column
summary(resale_flat_prices_2017$floor_area_sqm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   31.00   82.00   93.00   97.24  112.00  249.00

Next we plot a boxplot to visualise the distribution of the floor area.

# Boxplot of 'floor_area_sqm' column
ggplot(resale_flat_prices_2017, aes(y = floor_area_sqm)) +
  geom_boxplot(fill = "#b3b3b3", color = "#404040", outlier.color = "#606060") +
  coord_flip() +  
  theme_minimal() +
  labs(title = "Boxplot of Floor Area (sqm)",
       y = "Floor Area (sqm)") +
  theme(plot.title = element_text(hjust = 0.5),  
        axis.title.y = element_blank(),  
        axis.text = element_text(color = "#333333"),
        axis.title = element_text(color = "#333333"))

Observation

With the minimum floor area at 31 and the maximum at 249, with majority of the values between 82-112, it might be good to segment the floor areas into breaks of 50 to make the visualization cleaner.

Adjusting Floor Area Intervals

This section split the floor area into intervals of 50.

resale_flat_prices_2017$floor_area_interval <- cut(resale_flat_prices_2017$floor_area_sqm, 
                              breaks = c(0, 50, 100, 150, 200, 250),
                              include.lowest = TRUE,
                              right = FALSE,)

Plot

# Boxplot
ggplot(data = resale_flat_prices_2017, aes(x = floor_area_interval, y = resale_price, fill = floor_area_interval)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Set3") +  
  labs(x = "Floor Area Interval (sqm)", y = "Resale Price", title = "Resale Price by Floor Area Interval") +
  theme(axis.text.x = element_text(hjust = 0.5), 
        plot.title = element_text(hjust = 0.5))  

# Violin Plot
ggplot(data = resale_flat_prices_2017, aes(x = floor_area_interval, y = resale_price, fill = floor_area_interval)) +
  geom_violin(trim = FALSE) +  
  scale_fill_brewer(palette = "Set3") + 
  labs(x = "Floor Area Interval (sqm)", y = "Resale Price", title = "Resale Price Distribution by Floor Area Interval") +
  theme(axis.text.x = element_text(hjust = 0.5),  
        plot.title = element_text(hjust = 0.5)) 

Observation / Insight

From the boxplot, it can be observed that there exists outliers within each category except for the 200-250sqm range, indicating that while floor area affects the resale price, there are other stronger factors possibly like location which prompts buyers to pay more for the same floor area.

Further more the violin plot, the increasing sleekness of the shape as the floor area category increases suggests that as the floor area increases, the variance of resale prices buyers are willing to pay increases.





Research Question 2 - Does the storey range of a flat have an impact on its resale price?

Approach

  1. Check the unique values of the storey ranges.
  2. Calculate the average resale price by storey range.
  3. Plot to visualise.
  4. Further aggregrate by flat type.
  5. Plot to see median resale price by storey range and flat type.

Check unique values

This section checks for the unique values of the ‘storey_range’ column.

# Unique Values
unique(resale_flat_prices_2017$storey_range)
##  [1] "10 TO 12" "01 TO 03" "04 TO 06" "07 TO 09" "13 TO 15" "19 TO 21"
##  [7] "22 TO 24" "16 TO 18" "34 TO 36" "28 TO 30" "37 TO 39" "49 TO 51"
## [13] "25 TO 27" "40 TO 42" "31 TO 33" "46 TO 48" "43 TO 45"

Observation

The storey ranges of the flats looks to be up till 51 stories, at an interval range of 2 stories by category.

Calculate Average Resale Price by Storey Range

This section calculates the average price for each storey range category.

# Average Price
average_prices_storey_range <- resale_flat_prices_2017 %>%
  group_by(storey_range) %>%
  summarise(average_price = mean(resale_price, na.rm = TRUE))

average_prices_storey_range

Plot

ggplot(average_prices_storey_range, aes(x = storey_range, y = average_price, fill = average_price)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Storey Range", y = "Average Resale Price", title = "Average Resale Price by Storey Range") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1, size = 10)) +
  theme(axis.title = element_text(size = 12, face = "bold"),
        plot.title = element_text(size = 14, face = "bold"),
        legend.position = "none") +  
  guides(fill = FALSE)  
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Observation

It seems there is a clear trend of resale price increasing for the upper floors. We next bring in flat type to further sieve out more insights.

Aggregate the Data by Storey Range and Flat Type

This section sorts the data by flat type too.

average_prices_storey_flat <- resale_flat_prices_2017 %>%
  group_by(storey_range, flat_type) %>%
  summarise(average_price = mean(resale_price, na.rm = TRUE))
## `summarise()` has grouped output by 'storey_range'. You can override using the
## `.groups` argument.
average_prices_storey_flat

Plot

ggplot(average_prices_storey_flat, aes(x = storey_range, y = average_price, fill = flat_type)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") + 
  scale_fill_brewer(palette = "Set3") +  
  labs(x = "Storey Range", y = "Average Resale Price", title = "Average Resale Price by Storey Range and Flat Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1),  
        axis.title = element_text(size = 12, face = "bold"),  
        plot.title = element_text(size = 14, face = "bold"),  
        legend.position = "bottom",  
        legend.title = element_blank())  

Observation

Using a stacked bar graph makes it hard to compare the insights. We next do a facet wrap to plot out the graphs by flat type.

Create a Numeric Mapping for storey_range

average_prices_storey_flat$storey_range_numeric <- as.numeric(factor(average_prices_storey_flat$storey_range))

Plot

ggplot(average_prices_storey_flat, aes(x = storey_range, y = average_price, fill = storey_range_numeric)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") + 
  scale_fill_gradient(low = "lightblue", high = "darkblue") + 
  labs(x = "Storey Range", y = "Average Resale Price", title = "Average Resale Price by Storey Range, Faceted by Flat Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),  
        axis.title = element_text(size = 12, face = "bold"),  
        plot.title = element_text(size = 14, face = "bold")) +  
  facet_wrap(~ flat_type, scales = "free_y", ncol = 1) +  
  guides(fill = "none") +
  scale_y_continuous(labels = scales::number_format())  

Insights

Multigeneration and one room flats have a low storey range of only up to 12 floors. You would expect that as the storey range increases, the resale price increases, however after a certain floor range, usually the 90% of the storey range of that flat type, the resale prices falls, as seen from the 2room, 3room, 5room and execuetive flats.





Research Question 3 - Is there a relationship between location and flat type?

Approach

  1. Aggregrate the data by flat type and average resale price.
  2. Reshape the data
  3. Plot

Aggregate the Data

This section aggregreates the data from the original dataset.

average_resale_prices_town_flat_type <- resale_flat_prices_2017 %>%
  group_by(town, flat_type) %>%
  summarise(average_price = mean(resale_price, na.rm = TRUE))
## `summarise()` has grouped output by 'town'. You can override using the
## `.groups` argument.
average_resale_prices_town_flat_type

Reshape the Data for Heatmap

This section reshapes the data for a heatmap.

heatmap_data <- dcast(average_resale_prices_town_flat_type, town ~ flat_type, value.var = "average_price")

heatmap_data

Plot

# Melt the data back to a long format for ggplot2
heatmap_data_long <- melt(heatmap_data, id.vars = "town", variable.name = "flat_type", value.name = "average_price")

#Plot
ggplot(heatmap_data_long, aes(x = flat_type, y = town, fill = average_price)) +
  geom_tile(color = "white", size = 0.1) +  # Add border to each tile for better separation
  scale_fill_gradientn(colors = RColorBrewer::brewer.pal(9, "Blues"), 
                       labels = label_number(big.mark = ",", decimal.mark = ".", accuracy = 1)) +
  labs(x = "Flat Type", y = "Town", fill = "Average Price", title = "Average Resale Prices by Town and Flat Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),  # Adjust for better readability
        axis.text.y = element_text(size = 8),  # Adjust size for better readability if needed
        plot.title = element_text(hjust = 0.5, size = 14),  # Center and size the title
        legend.title = element_text(size = 10),  # Adjust legend title size
        legend.text = element_text(size = 8))  # Adjust legend text size
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Insights

Darker colors indicate higher average resale prices. Lighter colors represent lower average resale prices, indicating more affordable options. From the heatmap, it can be deduced that location does matter for resale price of the same flat type, as can be seen in how Central area and Queenstown outperforms its peers for the flat type.





Q2A - Visualising the distribution of recent HDB resale prices (4 room) with a joyplot

Recreate the ridgeline plot showing the distribution of HDB resale prices of 4-room units by neighbourhood in price per square metre, in 2022.

Approach

  1. Calculate the price per square metre (new column) for each transaction.
  2. Sort for only 4-room transactions in 2022.
  3. Plot

Calculate price per sqaure metre

# Calculate price per square meter
resale_flat_prices_2017$price_per_sqm <- resale_flat_prices_2017$resale_price / resale_flat_prices_2017$floor_area_sqm

Sort

#Sort for only 4-room transactions in 2022.
resale_2022_4rm <- resale_flat_prices_2017 %>%
  filter(flat_type == "4 ROOM" & year(month) == 2022)

# Check
resale_2022_4rm 
resale_2022_4rm$price_per_sqm <- as.numeric(as.character(resale_2022_4rm$price_per_sqm))

Plot

ggplot(resale_2022_4rm, aes(x = price_per_sqm, y = reorder(town, -price_per_sqm, FUN = median))) +
  geom_density_ridges_gradient(
    aes(fill = after_stat(x)), 
    scale = 3, rel_min_height = 0.01, gradient_lwd = 1.
  ) +
  scale_fill_viridis_c(option = "C") +  
  labs(title = "HDB resale prices (4 room) in 2022 by neighbourhood",
       subtitle = "Neighbourhoods exhibit large differences",
       x = "Price per square metre ($)",
       y = "") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())
## Picking joint bandwidth of 304

ggplot(resale_2022_4rm, aes(x = price_per_sqm, y = reorder(town, -price_per_sqm, FUN = median))) +
  geom_density_ridges(alpha=0.6, stat="binline", bins=20) +
  scale_fill_viridis_c(option = "C", guide = "none") +  # Hide the legend
  labs(title = "HDB resale prices (4 room) in 2022 by neighbourhood",
       subtitle = "Neighbourhoods exhibit large differences",
       x = "Price per square metre ($)",
       y = "") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

Plot Comments

Tried another variation of the ridgeline plot, but does not perform as well as density ridges gradient

Observations

Most of the price per square metre falls within the $5000 range, as indicated by majority of the plot being purple color. Comparing the height of the ridgeplots relative to the towns, it can be deduced that towns further away from the city centre (eg. woodlands, pasir ris) have more transactions relative to the areas nearer the city centre like central area and queenstown. Furthermore, these sharp height and singular peak of these towns away from the city show low variation of the distribution of the prices. This is different from areas nearer the city centre, where there are multiple peaks / a more gradual spread of ridges across the price range.





Q2B - Visualising the temporal change of HDB resale prices

Create a visual comparing prices between different years.

Approach

Comparison of HDB resale prices (4 room) between 2018 and 2023 by neighbourhood.

  1. (Previously done) Calculate the price per square metre (new column) for each transaction.
  2. Sort for only 4-room transactions from 2018 - 2023.
  3. Plot

Sort

# Sort for only 4-room transactions from 2018 to 2023.
resale_2018_2023_4rm <- resale_flat_prices_2017 %>%
  filter(flat_type == "4 ROOM" & year(month) >= 2018 & year(month) <= 2023)

# Check
head(resale_2018_2023_4rm)
# Calculate median prices for 2018 and 2023
median_prices <- resale_2018_2023_4rm %>%
  group_by(town, year = year(month)) %>%
  filter(year == 2018 | year == 2023) %>%
  summarise(median_price_per_sqm = median(price_per_sqm), .groups = 'drop') %>%
  spread(key = year, value = median_price_per_sqm)

# Check
head(median_prices)

Plot

median_prices_long <- median_prices %>%
  pivot_longer(cols = c(`2018`, `2023`), names_to = "Year", values_to = "Price")

# Plot
ggplot(median_prices_long, aes(x = Price, y = reorder(town, Price), color = Year)) +
  geom_segment(data = median_prices, aes(x = `2018`, xend = `2023`, y = town, yend = town), color = "grey") +
  geom_point(size = 3) +  # Place the geom_point layer after geom_segment
  scale_color_manual(values = c(`2018` = "lightblue", `2023` = "darkblue")) +
  scale_x_continuous(name = "Price per Square Metre ($)", labels = scales::dollar) +
  labs(title = "Comparison of HDB resale prices (4 room) between 2018 and 2023",
       subtitle = "Analysis by neighbourhood") +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  guides(color = guide_legend(title = "Year")) 





Q3 - Freeform Task

Dataset Information

For the freeform task, I have decided to look at at chess dataset to unearth patterns and relationships, since chess is a hobby of mine. The dataset can be downloaded from https://www.kaggle.com/datasets/datasnaek/chess.





Summary of Analysis

Research Question 1

3.1 - What are the favourite chess openings for the different player classes?

Plot Used

Stacked Barchart

Insight

It can be concluded that within all player classes, the Sicilian Defence is a highly popular chess opening. Thus, considering the pros, you might want to learn these 3 openings, Sicilian Defence, French and Caro-Kann Defence for a higher probability of winning.

Research Question 2

3.2 - Does white or black really give you an advantage in winning?

Plot Used

Barchart

Insight

Based from the analysis of the results of the game outcomes only, it can be deduced that playing as white has a small advantage and probability that you will win, probably due to the first turn advantage.

Research Question 3

3.3 - Which chess opening should White use?

Plot Used

Barchart

Insight

Therefore to get the highest probability of winning, I will strive to play as white and use the Sicilian Defence.





Import Data

The code below imports the downloaded data.

# Import Data
suppressMessages({
  chess <- read_csv(
    'data/games_metadata_profile_2024_01.csv'
    )
  
})
print('loaded')
## [1] "loaded"





EDA

We first begin with an exploratory data analysis to scope the dataset.

Preview

head(chess)

Shape of Dataset

dim(chess)
## [1] 130922     33

Observation

Dataset is quite large at 130,922 entries.

Columns Names

names(chess)
##  [1] "GameID"               "Event"                "Round"               
##  [4] "Site"                 "Date"                 "Time"                
##  [7] "White"                "WhiteElo"             "WhiteRatingDiff"     
## [10] "White_is_deleted"     "White_tosViolation"   "White_profile_flag"  
## [13] "White_createdAt"      "White_playTime_total" "White_count_all"     
## [16] "White_title"          "Black"                "BlackElo"            
## [19] "BlackRatingDiff"      "Black_is_deleted"     "Black_tosViolation"  
## [22] "Black_profile_flag"   "Black_createdAt"      "Black_playTime_total"
## [25] "Black_count_all"      "Black_title"          "Moves"               
## [28] "TotalMoves"           "ECO"                  "Opening"             
## [31] "TimeControl"          "Termination"          "Result"

Observation

33 possible features to play with.

Number of Chess Openings

# Number of Chess Openings
length(unique(chess$ECO))
## [1] 465

Observation

Quite a large number of different chess openings used. Interesting variable to work with. For reference, each chess opening is ascribed to a code.

Distribution of Total Moves

# Boxplot 
boxplot(chess$TotalMoves, 
        horizontal = TRUE,
        main = "Total Moves in Chess Games", 
        xlab = "Total Moves", 
        ylab = "Frequency")

Observation

Most games are about 50 - 75 moves with the median at about 60 moves. Quite a number of outliers that fall outside the spectrum from 140+ moves onwards.

List of Chess Moves

chess$Moves[1]
## [1] "1. d4 { [%eval 0.13] [%clk 0:05:00] } 1... d5 { [%eval 0.27] [%clk 0:05:00] } 2. c4 { [%eval 0.0] [%clk 0:05:02] } 2... Nf6?! { [%eval 0.71] [%clk 0:04:58] } 3. cxd5?! { [%eval 0.4] [%clk 0:05:02] } 3... Qxd5?! { [%eval 0.83] [%clk 0:04:58] } 4. Nc3?! { [%eval 0.5] [%clk 0:05:02] } 4... Qa5?! { [%eval 0.91] [%clk 0:04:55] } 5. Bd2?! { [%eval 0.95] [%clk 0:05:02] } 5... c6?! { [%eval 1.54] [%clk 0:04:56] } 6. e4?! { [%eval 1.37] [%clk 0:05:01] } 6... Qb4? { [%eval 3.79] [%clk 0:04:57] } 7. Nf3? { [%eval 1.99] [%clk 0:04:37] } 7... Qxb2?? { [%eval 5.21] [%clk 0:04:55] } 8. Bc4?? { [%eval 2.18] [%clk 0:04:22] } 8... Qb6?? { [%eval 1.99] [%clk 0:04:46] } 9. O-O?? { [%eval 2.32] [%clk 0:04:12] } 9... Bg4?? { [%eval 2.82] [%clk 0:04:44] } 10. Be2? { [%eval 1.3] [%clk 0:03:28] } 10... Bxf3? { [%eval 1.21] [%clk 0:04:42] } 11. Bxf3? { [%eval 1.4] [%clk 0:03:29] } 11... Qxd4? { [%eval 1.44] [%clk 0:04:42] } 12. Ne2? { [%eval -0.64] [%clk 0:03:24] } 12... Qd8? { [%eval 1.61] [%clk 0:04:41] } 13. Qc2? { [%eval -1.26] [%clk 0:03:16] } 13... Nbd7? { [%eval -1.48] [%clk 0:04:37] } 14. Rfd1? { [%eval -1.25] [%clk 0:03:16] } 14... e6? { [%eval -0.89] [%clk 0:04:36] } 15. Ng3?! { [%eval -1.85] [%clk 0:02:32] } 15... Qc7?! { [%eval -1.78] [%clk 0:04:36] } 16. a4?! { [%eval -2.1] [%clk 0:02:10] } 16... h5?! { [%eval -1.58] [%clk 0:04:30] } 17. Ne2?! { [%eval -1.92] [%clk 0:02:02] } 17... Ne5?! { [%eval -1.9] [%clk 0:04:30] } 18. Bf4?! { [%eval -2.19] [%clk 0:01:59] } 18... Nxf3+?! { [%eval -1.49] [%clk 0:04:28] } 19. gxf3?! { [%eval -1.46] [%clk 0:02:00] } 19... e5?! { [%eval -1.41] [%clk 0:03:56] } 20. Be3?! { [%eval -1.43] [%clk 0:01:57] } 20... Be7?! { [%eval -1.32] [%clk 0:03:29] } 21. Ng3?! { [%eval -1.72] [%clk 0:01:56] } 21... g6?! { [%eval -1.78] [%clk 0:03:21] } 22. a5?! { [%eval -1.82] [%clk 0:01:37] } 22... Qc8?! { [%eval -1.0] [%clk 0:03:18] } 23. Bc5?! { [%eval -1.69] [%clk 0:01:05] } 23... Bxc5?! { [%eval -1.45] [%clk 0:02:45] } 24. Qxc5?! { [%eval -1.56] [%clk 0:01:05] } 24... Qc7?! { [%eval -1.4] [%clk 0:02:26] } 25. Rd6?! { [%eval -2.24] [%clk 0:00:55] } 25... Qe7? { [%eval -0.1] [%clk 0:02:12] } 26. Rad1? { [%eval -1.12] [%clk 0:00:40] } 26... O-O?! { [%eval -0.45] [%clk 0:02:06] } 27. h3? { [%eval -2.14] [%clk 0:00:07] } 27... Rad8? { [%eval -2.22] [%clk 0:02:03] } 0-1"

Observation

The cells in each ‘Moves’ column looks to contain a list of moves placed by both players. This looks to be the most interesting column to work with although it would require some pre-processing.

Study of the Chess Players

# Get unique values from the 'White' and 'Black' columns
unique_white <- unique(chess$White)
unique_black <- unique(chess$Black)

# Remove values present in 'White' from unique values in 'Black'
unique_black <- unique_black[!unique_black %in% unique_white]

# Sum up the lengths of unique values from both columns
total_unique <- length(unique_white) + length(unique_black)

total_unique
## [1] 202566

Observation

The number of unique players in this dataset is quite high at 202,566 players. Doing an analysis on the interactions between these players might yield something interesting.

Distribution of Elo Rankings

# Plot boxplot for WhiteElo
ggplot(chess, aes(y = WhiteElo)) +
  geom_boxplot() +
  coord_flip() +
  labs(y = "White Elo", title = "Boxplot of White Elo Ratings")

summary(chess$WhiteElo)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     400    1318    1610    1618    1907    3233
# Plot boxplot for BlackElo
ggplot(chess, aes(y = BlackElo)) +
  geom_boxplot() +
  coord_flip() +
  labs(y = "Black Elo", title = "Boxplot of Black Elo Ratings")

summary(chess$BlackElo)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     400    1317    1613    1618    1907    3198
Observation

Majority of players fell within the Elo range of 1300 - 1907. With reference the the chess rating system (https://en.wikipedia.org/wiki/Chess_rating_system), most players are within Class D - Class A. However from the boxplot, it can be seen there are many outliers at opposite ends of the spectrum, presumably new players and very experienced players, with the highest rating at 3233, which are ‘super grandmasters’.

Thus seperating the dataset into 3 groups of players, beginners, amateurs and advanced players, might reveal insights into the differences in their decision making.

Extracting out each class of players

Beginner

# Extract rows 
chess_beginner <- chess[chess$WhiteElo < 1318 & chess$BlackElo < 1317, ]

# Number of Games
nrow(chess_beginner)
## [1] 29543

Amateurs

# Extract rows 
chess_amateur <- chess[(chess$WhiteElo > 1318 & chess$WhiteElo < 1907) | (chess$BlackElo > 1317 & chess$BlackElo < 1907), ]

# Number of Games
nrow(chess_amateur)
## [1] 72314

Pros

# Extract rows 
chess_pro <- chess[chess$WhiteElo > 1907 | chess$BlackElo > 1907, ]

# Number of Games
nrow(chess_pro)
## [1] 36848

Plot

# Get row counts of the dataframes
row_counts <- c(nrow(chess_beginner), nrow(chess_amateur), nrow(chess_pro))

# Names of the dataframes
df_names <- c("Beginner", "Amateur", "Pro")

# Create bar plot
barplot(row_counts, names.arg = df_names, xlab = "Player Class", ylab = "Number of Games", main = "Game Count by Player Class", col = "skyblue")

#### Observation The original dataframe has been split into 3 categories of players, with the bulk of the players being amateur players.





3.1 - What are the favourite chess openings for the different player classes?

Count Unique Openings

# Define a function to count unique openings for a dataframe
count_unique_openings <- function(df) {
  df %>%
    group_by(ECO) %>%
    summarise(Count = n())
}

# Apply the function to each dataframe
chess_beginner_counts <- count_unique_openings(chess_beginner)
chess_amateur_counts <- count_unique_openings(chess_amateur)
chess_pro_counts <- count_unique_openings(chess_pro)
# Number of Unique Openings for each Player Class

nrow(chess_beginner_counts)
## [1] 257
nrow(chess_amateur_counts)
## [1] 400
nrow(chess_pro_counts)
## [1] 459
chess_pro_counts

Map ECO Ranges to Opening Names

In this section we map the opening names to the ECO IDs. The ECO ranges are referenced from: https://www.365chess.com/eco.php

# Define ECO ranges and corresponding opening names
eco_ranges <- data.frame(
  Start = c("A00", "A01", "A02", "A04", "A10", "A40", "A42", "A43", "A45", "A47", "A48", "A50", "A51", "A53", "A56", "A57", "A60", "A80", "B00", "B01", "B02", "B06", "B07", "B10", "B20", "C00", "C20", "C21", "C23", "C25", "C30", "C40", "C41", "C42", "C44", "C45", "C46", "C47", "C50", "C51", "C53", "C55", "C60", "D00", "D01", "D02", "D03", "D04", "D06", "D07", "D10", "D16", "D17", "D20", "D30", "D43", "D50", "D70", "D80", "E00", "E01", "E10", "E11", "E12", "E20", "E60"),
  End = c("A00", "A01", "A03", "A09", "A39", "A41", "A42", "A44", "A46", "A47", "A49", "A50", "A52", "A55", "A56", "A59", "A79", "A99", "B00", "B01", "B05", "B06", "B09", "B19", "B99", "C19", "C20", "C22", "C24", "C29", "C39", "C40", "C41", "C43", "C44", "C45", "C46", "C49", "C50", "C52", "C54", "C59", "C99", "D00", "D01", "D02", "D03", "D05", "D06", "D09", "D15", "D16", "D19", "D29", "D42", "D49", "D69", "D79", "D99", "E00", "E09", "E10", "E11", "E19", "E59", "E99"),
  Opening = c("Polish Opening", "Nimzovich-Larsen attack", "Bird's opening", "Reti Opening", "English Opening", "Queen's Pawn", "Modern defence, Averbakh system", "Old Benoni defence", "Queen's Pawn Game", "Queen's Indian defence", "King's Indian, East Indian defence", "Queen's pawn game", "Budapest defence", "Old Indian defence", "Benoni defence", "Benko gambit", "Benoni defence", "Dutch", "King's pawn opening", "Scandinavian Defence", "Alekhine's defence", "Robatsch Defence", "Pirc defence", "Caro-Kann defence", "Sicilian Defence", "French Defence", "King's pawn game", "Centre Game", "Bishop's opening", "Vienna game", "King's gambit", "King's knight opening", "Philidor's defence", "Petrov's defence", "King's pawn game", "Scotch game", "Three knights game", "Four knights, Scotch variation", "Italian Game", "Evans gambit", "Giuoco Piano", "Two knights defence", "Ruy Lopez (Spanish opening)", "Queen's pawn game", "Richter-Veresov attack", "Queen's pawn game", "Torre attack (Tartakower variation)", "Queen's Pawn Game", "Queen's Gambit", "Queen's Gambit Declined, Chigorin defence", "Queen's Gambit Declined Slav defence", "Queen's Gambit Declined Slav accepted, Alapin variation", "Queen's Gambit Declined Slav, Czech defence", "Queen's gambit accepted", "Queen's gambit declined", "Queen's Gambit Declined semi-Slav", "Queen's Gambit Declined", "Neo-Gruenfeld defence", "Gruenfeld defence", "Queen's Pawn Game", "Catalan, closed", "Queen's Pawn Game", "Bogo-Indian defence", "Queen's Indian defence", "Nimzo-Indian defence", "King's Indian defence")
)

# Function to map ECO codes to opening names
map_eco_to_opening <- function(df) {
  df <- df %>%
    mutate(ECO_Start = str_sub(ECO, 1, 3)) %>%
    rowwise() %>%
    mutate(Opening = list(eco_ranges %>%
                            filter(ECO_Start >= Start & ECO_Start <= End) %>%
                            select(Opening) %>%
                            pull(Opening))) %>%
    unnest(Opening) %>%
    select(-ECO_Start)
  df
}

# Apply the function to your dataframes
chess_beginner_counts <- map_eco_to_opening(chess_beginner_counts)
chess_amateur_counts <- map_eco_to_opening(chess_amateur_counts)
chess_pro_counts <- map_eco_to_opening(chess_pro_counts)

Total count for openings by player class

aggregate_counts <- function(df) {
  df %>%
    group_by(Opening) %>%
    summarise(TotalCount = sum(Count)) %>%
    arrange(desc(TotalCount))
}

chess_beginner_agg <- aggregate_counts(chess_beginner_counts)
chess_amateur_agg <- aggregate_counts(chess_amateur_counts)
chess_pro_agg <- aggregate_counts(chess_pro_counts)
chess_pro_agg 

Test Plot

ggplot(chess_amateur_agg, aes(x = reorder(Opening, TotalCount), y = TotalCount)) +
  geom_bar(stat = "identity") +
  coord_flip()

Observation

Too many openings, distribution is too wide.

Select only Top 10 Openings

# Select top 10 openings based on TotalCount for each class
top_beginner <- chess_beginner_agg %>%
  arrange(desc(TotalCount)) %>%
  top_n(10, TotalCount) %>%
  mutate(Class = 'Beginner')

top_amateur <- chess_amateur_agg %>%
  arrange(desc(TotalCount)) %>%
  top_n(10, TotalCount) %>%
  mutate(Class = 'Amateur')

top_pro <- chess_pro_agg %>%
  arrange(desc(TotalCount)) %>%
  top_n(10, TotalCount) %>%
  mutate(Class = 'Pro')

# Combine into a single dataframe
combined_top_openings <- bind_rows(top_beginner, top_amateur, top_pro)

# Fill in missing combinations with zero counts
combined_top_openings <- combined_top_openings %>%
  complete(Opening, Class, fill = list(TotalCount = 0)) %>%
  arrange(Opening, Class)
# Plot
ggplot(combined_top_openings, aes(x = Opening, y = TotalCount, fill = Class)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.7)) +
  coord_flip() +
  theme(axis.text.y = element_text(angle = 0, hjust = 1)) +
  labs(title = "Top 10 Chess Openings by Player Class",
       y = "Chess Opening",
       x = "Total Count") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal()

###### Observation Need to normalise the openings since the total count of games played per class is different.

# Select top 10 openings based on TotalCount for each class and normalize counts
top_beginner <- chess_beginner_agg %>%
  arrange(desc(TotalCount)) %>%
  top_n(10, TotalCount) %>%
  mutate(Class = 'Beginner',
         NormalizedCount = TotalCount / sum(TotalCount))  # Normalizing counts

top_amateur <- chess_amateur_agg %>%
  arrange(desc(TotalCount)) %>%
  top_n(10, TotalCount) %>%
  mutate(Class = 'Amateur',
         NormalizedCount = TotalCount / sum(TotalCount))  # Normalizing counts

top_pro <- chess_pro_agg %>%
  arrange(desc(TotalCount)) %>%
  top_n(10, TotalCount) %>%
  mutate(Class = 'Pro',
         NormalizedCount = TotalCount / sum(TotalCount))  # Normalizing counts

# Combine into a single dataframe
combined_top_openings <- bind_rows(top_beginner, top_amateur, top_pro)

# Ensure each class has an entry for each opening by filling in missing combinations with zero counts
combined_top_openings <- combined_top_openings %>%
  complete(Opening, Class, fill = list(TotalCount = 0, NormalizedCount = 0)) %>%
  arrange(Opening, Class)
# Plot 
ggplot(combined_top_openings, aes(x = NormalizedCount, y = Opening, fill = Class)) +
  geom_bar(stat = "identity", position = "dodge", color = "white") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Top 10 Chess Openings by Player Class",
       subtitle = "What player class are you?",
       x = "Normalized Count") +
  theme_minimal() +
  theme(axis.line = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_text(hjust = 1),
        plot.title = element_text(hjust = 0, size = 16, face = "bold"),  
        plot.subtitle = element_text(hjust = 0, size = 12), 
        legend.title = element_blank())

Insight

It can be concluded that within all player classes, the Sicilian Defence is a highly popular chess opening. Thus, considering the pros, you might want to learn these 3 openings, Sicilian Defence, French and Caro-Kann Defence for a higher probability of winning.





3.2 - Does white or black really give you an advantage in winning?

This section explores the hypothesis if white starting first really gives a significant turn advantage for winning.

Transform the Results Column

# Convert the 'Result' column into a more descriptive outcome
chess<- chess%>%
  filter(Result %in% c("1-0", "0-1", "1/2-1/2")) %>%
  mutate(Outcome = case_when(
    Result == "1-0" ~ "White Win",
    Result == "0-1" ~ "Black Win",
    Result == "1/2-1/2" ~ "Draw"
  ))

Count the match outcome

# Count the number of each outcome type
outcome_counts <- chess%>%
  group_by(Outcome) %>%
  summarise(Count = n())

Plot

# Plot
ggplot(outcome_counts, aes(x = Outcome, y = Count, fill = Outcome)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Chess Game Outcomes by Starting Position",
       x = "Outcome",
       y = "Count",
       fill = "Outcome") +
  geom_text(aes(label = Count), vjust = -0.3)

Observation

Cannot really tell the difference visually between black and white.

Convert to Percentages Instead

# Calculate the percentages 
outcome_counts <- chess%>%
  filter(Result %in% c("1-0", "0-1", "1/2-1/2")) %>%
  mutate(Outcome = case_when(
    Result == "1-0" ~ "White Win",
    Result == "0-1" ~ "Black Win",
    Result == "1/2-1/2" ~ "Draw"
  )) %>%
  group_by(Outcome) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = (Count / sum(Count)) * 100)

Plot

# Plot
ggplot(outcome_counts, aes(x = Outcome, y = Percentage, fill = Outcome)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = sprintf("%.2f%%", Percentage)), vjust = -0.5, size = 3.5) +
  scale_fill_manual(values = c("White Win" = "#1f77b4", "Black Win" = "#ff7f0e", "Draw" = "#2ca02c")) +
  theme_minimal() +
  labs(title = "Percentage of Chess Game Outcomes by Starting Position",
       x = "Outcome",
       y = "Percentage")

Observation

Visually still unable to see that white has a higher probablity of winning than black.

Remove Draws from the graph

outcome_counts <- outcome_counts %>%
  filter(Outcome != "Draw") 

Plot

ggplot(outcome_counts, aes(x = Outcome, y = Percentage, fill = Outcome)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +  # Make the bar chart horizontal
  scale_fill_manual(values = c("Black Win" = "black", "White Win" = "white")) +
  labs(title = "Does the first turn matter?",
       subtitle = "An analysis of 130,922 game outcomes",
       x = "",  
       y = "Percentage (%)") +
  theme_minimal() +
  theme(axis.title.y = element_blank(),  
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0, size = 16, face = "bold"),  
        plot.subtitle = element_text(hjust = 0, size = 12),  
        plot.background = element_rect(fill = "#f0f0f0"),  
        panel.background = element_rect(fill = "#f0f0f0", color = NA))  

Insight

Based from the analysis of the results of the game outcomes only, it can be deduced that playing as white has a small advantage and probability that you will win, probably due to the first turn advantage.





3.3 - Which chess opening should White use?

Now that we have established that playing as white as a slight advantage of winning, I want to know which are the chess opening I should use when playing as white.

Function to map ECO codes to opening names

We match the opening names to the eco codes since the existing opening columns contains the opening names but it different formats.

# Function to map ECO codes to opening names
map_eco_to_opening <- function(ECO_code) {
  opening <- eco_ranges$Opening[which(ECO_code >= eco_ranges$Start & ECO_code <= eco_ranges$End)]
  if(length(opening) > 0) {
    return(opening[1])
  } else {
    return(NA)
  }
}

# Apply the function to create 'opening_new' column
chess$opening_new <- sapply(chess$ECO, map_eco_to_opening)

Filter Winning Games

We filter for the games where white wins

# Filter
opening_win <- chess %>%
  filter(Result == '1-0')

head(opening_win)

Get Count

#Count
opening_counts <- opening_win %>%
  group_by(opening_new) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

Plot

# Plot
ggplot(opening_counts, aes(x = reorder(opening_new, Count), y = Count)) +
  geom_bar(stat = "identity") +
  coord_flip() +  
  labs(title = "Count of Chess Openings in Wins for White",
       x = "Opening",
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  

#### Observation We can afford to remove the bottom few openings since they are rarely used.

Remove the bottom 25%

# Calculate the 25th percentile of the counts
threshold <- quantile(opening_counts$Count, 0.25)

# Filter out the openings in the bottom 25%
filtered_opening_counts <- opening_counts %>%
  filter(Count > threshold)

Plot

ggplot(filtered_opening_counts, aes(x = reorder(opening_new, Count), y = Count)) +
  geom_bar(stat = "identity", fill = "darkgrey") +
  coord_flip() +  
  labs(title = "Winning Chess Openings for White",
       subtitle = "Winning is Everything",
       x = "Opening",
       y = "") +  
  theme_minimal() +
  theme(axis.title.y = element_blank(),  
        plot.title = element_text(face = "bold", size = 14), 
        plot.subtitle = element_text(size = 10),  
        axis.text.x = element_text(hjust = 1))  

Observation

Therefore to get the highest probability of winning, I will strive to play as white and use the Sicilian Defence.

For Fun

For the sake of curiosity, I also plotted black.

Get Black Wins

# Filter
opening_win_b <- chess %>%
  filter(Result == '0-1')

#Count
opening_count_b <- opening_win_b %>%
  group_by(opening_new) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

# Calculate the 25th percentile of the counts
threshold <- quantile(opening_count_b$Count, 0.25)

# Filter out the openings in the bottom 25%
filtered_opening_count_b <- opening_count_b %>%
  filter(Count > threshold)

Plot

ggplot(filtered_opening_count_b, aes(x = reorder(opening_new, Count), y = Count)) +
  geom_bar(stat = "identity", fill = "black") +
  coord_flip() +  
  labs(title = "Winning Chess Openings for Black",
       subtitle = "Winning is Everything",
       x = "Opening",
       y = "") +  
  theme_minimal() +
  theme(axis.title.y = element_blank(),  
        plot.title = element_text(face = "bold", size = 14), 
        plot.subtitle = element_text(size = 10),  
        axis.text.x = element_text(hjust = 1))  

Observation

Looks like Sicilian Defence is the best overall strategy.





Limitations

Note that fact there different types of chess games were played and different types might result in different playing styles and decisions which might affect the moves / openings used.

A player interaction analysis would also be interesting…..

I acknowledge that the code can be more efficient in terms of compiling time and unnecessary variables but its 3am and I’m really tired, I promise assignment 2’s code will be more efficient.

End Notes

I hope you enjoyed the insights from the chess dataset because I did. I shall now endeavour to start my matches as white and open with the Sicilian Defense.